The goal of this assignment is give you practice working with accuracy and other recommender system metrics. In this assignment you’re asked to do at least one or (if you like) both of the following: •Work in a small group, and/or •Choose a different dataset to work with from your previous projects
As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.
Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.
Compare and report on any change in accuracy before and after you’ve made the change in #2.
As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.
library(recommenderlab)
library(tidyverse)
data("Jester5k")
dim(Jester5k)
## [1] 5000 100
typeof(Jester5k)
## [1] "S4"
All of the objects in recommenderLab are created under the S4 Object Oriented system, which presents a different approach to thinking about how the dataset is manipulated and how models are used.
# Summary of ratings data
jester_df <- as(Jester5k, 'data.frame')
summary(jester_df$rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.95 -3.06 1.46 0.85 5.10 9.90
The ratings are on a scale of -10 to +10. Since we have a mean rating of 0.85 and a median of 1.46 we can consider that range the average, and certainly not good by joke standards.
# Total number of ratings
nratings(Jester5k)
## [1] 362106
# Number of ratings per user
summary(rowCounts(Jester5k))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.00 53.00 72.00 72.42 100.00 100.00
# Sparsity percentage
nratings(Jester5k) / (dim(Jester5k)[1] * dim(Jester5k[2]))
## [1] 72.421200 0.724212
jester_matrix <- spread(jester_df, key = "item", value = "rating")
jester_matrix
Using tidyr and the gather method we can take a quick snapshot at what the matrix looks like. However, since this was a realratingsMatrix S4 object, it takes more tinkering to get it to do what you’re normally used to doing.
jokes_rm <- Jester5k[rowCounts(Jester5k) > 50]
min(rowCounts(jokes_rm))
## [1] 51
hist(getRatings(jokes_rm), breaks=100)
We can see when we plot a histogram that shows the negative vales occur with similar frequencies and the positive ratgins are more frequent but slope off as you get towards the max rating of 10.
Let’s take a look at the same distribution after normalization.
One by row centering…
hist(getRatings(normalize(jokes_rm, method="center")), breaks = 100)
And the other by Z-score.
hist(getRatings(normalize(jokes_rm, method="Z-score")), breaks = 100)
We can see the peak ratings in this reduced set range from about 0-1.
Lastly we can take a quick look at how many jokes each user rated and the average rating per joke, by taking the row count and column mean, respectively.
hist(rowCounts(jokes_rm), breaks = 50)
hist(colMeans(jokes_rm), breaks = 20)
As a quick aside, we can find the max value, or “funniest” joke. Here it is…
funniest <- which.max(colMeans(Jester5k))
cat(JesterJokes[funniest])
## A guy goes into confession and says to the priest, "Father, I'm 80 years old, widower, with 11 grandchildren. Last night I met two beautiful flight attendants. They took me home and I made love to both of them. Twice." The priest said: "Well, my son, when was the last time you were in confession?" "Never Father, I'm Jewish." "So then, why are you telling me?" "I'm telling everybody."
We will take four different recommender models. User-based Collaborative Filtering (UBCF), Item-based Collaborative Filtering (IBCF), Random recommendations (RANDOM), and selections based on Popularity (POPULAR). For each of the four models we will apply six combinations of similarity (cosine and pearson) and normalization (row center, Z-score, none) to them, for a total of 24 models.
First we build an evaluation set, which we use on the jokes_rm dataset. The data set is split at 80% training, 20% test. Jokes often have tough critics so we will consider a rating of 5 a “good” rating, which is at the edge of the 3rd interquartile range.
eval_set <- evaluationScheme(data = jokes_rm,
method = "split",
train = 0.8,
given = 30,
goodRating = 5)
eval_set
## Evaluation scheme with 30 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=5.000000
## Data set: 3875 x 100 rating matrix of class 'realRatingMatrix' with 314302 ratings.
Now that the evaluation set is created, we will create and evaluate each of the four subject models, with their six subcomponents, and plot Precision-recall and ROC curves to visually evaluate model performance.
ubcf_models <- list(
ubcf_cos_null = list(name = "UBCF", param = list(method = "cosine", normalize = NULL)),
ubcf_prs_null = list(name = "UBCF", param = list(method = "pearson", normalize = NULL)),
ubcf_cos_center = list(name = "UBCF", param = list(method = "cosine", normalize = "center")),
ubcf_prs_center = list(name = "UBCF", param = list(method = "pearson", normalize = "center")),
ubcf_cos_z = list(name = "UBCF", param = list(method = "cosine", normalize = "Z-score")),
ubcf_prs_z = list(name = "UBCF", param = list(method = "pearson", normalize = "Z-score"))
)
ubcf_eval_results <- evaluate(x = eval_set,
method = ubcf_models,
n = seq(10, 100, 10)
)
## UBCF run fold/sample [model time/prediction time]
## 1 [0.011sec/4.013sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.001sec/4.822sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.04sec/4.03sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.038sec/3.731sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.142sec/3.628sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.139sec/4.278sec]
plot(ubcf_eval_results, "prec/rec", annotate = T, main = "Precision Recall")
title("UBCF Precision-recall")
plot(ubcf_eval_results, annotate = T)
title("UBCF ROC curve")
ibcf_models <- list(
ibcf_cos_null = list(name = "IBCF", param = list(method = "cosine", normalize = NULL)),
ibcf_prs_null = list(name = "IBCF", param = list(method = "pearson", normalize = NULL)),
ibcf_cos_center = list(name = "IBCF", param = list(method = "cosine", normalize = "center")),
ibcf_prs_center = list(name = "IBCF", param = list(method = "pearson", normalize = "center")),
ibcf_cos_z = list(name = "IBCF", param = list(method = "cosine", normalize = "Z-score")),
ibcf_prs_z = list(name = "IBCF", param = list(method = "pearson", normalize = "Z-score"))
)
ibcf_eval_results <- evaluate(x = eval_set,
method = ibcf_models,
n = seq(10, 100, 10)
)
## IBCF run fold/sample [model time/prediction time]
## 1 [0.211sec/0.295sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.35sec/0.22sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.217sec/0.185sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.289sec/0.2sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.298sec/0.23sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.376sec/0.265sec]
plot(ibcf_eval_results, "prec/rec", annotate = T, main = "Precision Recall")
title("IBCF Precision-recall")
plot(ibcf_eval_results, annotate = T)
title("IBCF ROC curve")
random_models <- list(
random_cos_null = list(name = "RANDOM", param = list(method = "cosine", normalize = NULL)),
random_prs_null = list(name = "RANDOM", param = list(method = "pearson", normalize = NULL)),
random_cos_center = list(name = "RANDOM", param = list(method = "cosine", normalize = "center")),
random_prs_center = list(name = "RANDOM", param = list(method = "pearson", normalize = "center")),
random_cos_z = list(name = "RANDOM", param = list(method = "cosine", normalize = "Z-score")),
random_prs_z = list(name = "RANDOM", param = list(method = "pearson", normalize = "Z-score"))
)
random_eval_results <- evaluate(x = eval_set,
method = random_models,
n = seq(10, 100, 10)
)
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.003sec/0.218sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.003sec/0.225sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.003sec/0.208sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.003sec/0.192sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.003sec/0.192sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.003sec/0.19sec]
plot(random_eval_results, "prec/rec", annotate = T, main = "Precision Recall")
title("RANDOM Precision-recall")
plot(random_eval_results, annotate = T)
title("RANDOM ROC curve")
popular_models <- list(
popular_cos_null = list(name = "POPULAR", param = list(method = "cosine", normalize = NULL)),
popular_prs_null = list(name = "POPULAR", param = list(method = "pearson", normalize = NULL)),
popular_cos_center = list(name = "POPULAR", param = list(method = "cosine", normalize = "center")),
popular_prs_center = list(name = "POPULAR", param = list(method = "pearson", normalize = "center")),
popular_cos_z = list(name = "POPULAR", param = list(method = "cosine", normalize = "Z-score")),
popular_prs_z = list(name = "POPULAR", param = list(method = "pearson", normalize = "Z-score"))
)
popular_eval_results <- evaluate(x = eval_set,
method = popular_models,
n = seq(10, 100, 10))
plot(popular_eval_results, "prec/rec", annotate = T, main = "Precision Recall")
title("POPULAR Precision-recall")
plot(popular_eval_results, annotate = T)
title("POPULAR ROC curve")
After looking at the Precision and ROC curves on the four methods, it appears that the different subsets of models were more accurate than others. Three out of the four included Pearson correlation and two of the four had a Z-score normalization. The below code chunks provide a litle more information to the selected models.
# Set the training, known and unknown sets
training_set <- getData(eval_set, "train")
known_set <- getData(eval_set, "known")
unknown_set <- getData(eval_set, "unknown")
ubcf_rec <- Recommender(data = training_set, method = "UBCF")
ibcf_rec <- Recommender(data = training_set, method = "IBCF")
popular_rec <- Recommender(data = training_set, method = "POPULAR")
random_rec <- Recommender(data = training_set, method = "RANDOM")
ubcf_model <- predict(ubcf_rec, known_set, type = "ratingMatrix")
ibcf_model <- predict(ibcf_rec, known_set, type = "ratingMatrix")
popular_model <- predict(popular_rec, known_set, type = "ratingMatrix")
random_model <- predict(random_rec, known_set, type = "ratingMatrix")
I relied on the ROC curver over the Precision-Recall curves since it seems like we have a fairly balanced dataset.
# UBCF Pearson Z-score
ubcf_prs_z_rec <- Recommender(data = training_set, method = "UBCF", parameter = list(method = "pearson", normalize = "Z-score"))
# IBCF Pearson Row Centering
ibcf_prs_c_rec <- Recommender(data = training_set, method = "IBCF", parameter = list(method = "pearson", normalize = "center"))
# POPULAR Cosine similarity Z-score
popular_cos_z_rec <- Recommender(data = training_set, method = "POPULAR", parameter = list(method = "cosine", normalize = "Z-score"))
## Available parameter (with default values):
## normalize = center
## aggregationRatings = new("standardGeneric", .Data = function (x, na.rm = FALSE, dims = 1, ...) standardGeneric("colMeans"), generic = "colMeans", package = "base", group = list(), valueClass = character(0), signature = "x", default = new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, dims = 1, ...) base::colMeans(x, na.rm = na.rm, dims = dims, ...), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "colMeans"), skeleton = (new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, dims = 1, ...) base::colMeans(x, na.rm = na.rm, dims = dims, ...), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "colMeans"))(x, na.rm, dims, ...))
## aggregationPopularity = new("standardGeneric", .Data = function (x, na.rm = FALSE, dims = 1, ...) standardGeneric("colSums"), generic = "colSums", package = "base", group = list(), valueClass = character(0), signature = "x", default = new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, dims = 1, ...) base::colSums(x, na.rm = na.rm, dims = dims, ...), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "colSums"), skeleton = (new("derivedDefaultMethod", .Data = function (x, na.rm = FALSE, dims = 1, ...) base::colSums(x, na.rm = na.rm, dims = dims, ...), target = new("signature", .Data = "ANY", names = "x", package = "methods"), defined = new("signature", .Data = "ANY", names = "x", package = "methods"), generic = "colSums"))(x, na.rm, dims, ...))
## verbose = FALSE
# RANDOM Pearson WITHOUT normalization
random_prs_n_rec <- Recommender(data = training_set, method = "RANDOM", parameter = list(method = "pearson", normalize = NULL))
ubcf_prs_z_model <- predict(ubcf_prs_z_rec, known_set, type = "ratingMatrix")
ibcf_prs_c_model <- predict(ibcf_prs_c_rec, known_set, type = "ratingMatrix")
popular_cos_z_model <- predict(popular_cos_z_rec, known_set, type = "ratingMatrix")
random_prs_n_model <- predict(random_prs_n_rec, known_set, type = "ratingMatrix")
##Accuracy
The results below show before and after results with different models selected based on specific similarity normalization methods. The UBCF with Pearson Similarity with Z-score normalization was the model with the lowest error rate across all three measures.
error <- rbind(
UBCF = calcPredictionAccuracy(ubcf_model, unknown_set),
UBCF_prs_z = calcPredictionAccuracy(ubcf_prs_z_model, unknown_set),
IBCF = calcPredictionAccuracy(ibcf_model, unknown_set),
IBCF_prs_c = calcPredictionAccuracy(ibcf_prs_c_model, unknown_set),
POPULAR = calcPredictionAccuracy(popular_model, unknown_set),
POPULAR_cos_z = calcPredictionAccuracy(popular_cos_z_model, unknown_set),
RANDOM = calcPredictionAccuracy(random_model, unknown_set),
RANDOM_prs_n = calcPredictionAccuracy(random_prs_n_model, unknown_set)
)
error
## RMSE MSE MAE
## UBCF 4.425417 19.58431 3.491192
## UBCF_prs_z 4.368536 19.08411 3.406390
## IBCF 4.922449 24.23050 3.930794
## IBCF_prs_c 4.427983 19.60703 3.456311
## POPULAR 4.471660 19.99574 3.555921
## POPULAR_cos_z 4.470542 19.98575 3.519076
## RANDOM 6.365253 40.51645 4.939390
## RANDOM_prs_n 6.367879 40.54988 4.949997
Another item that can be tested is a hybrid recommender system that can take features from one more more recommenders on a weighted basis to obtain a little bit of user/item accuracy coupled with novelty and serentipity from the popularity and random models. There were datatype inconsistencies regarding testing the hybrid system, which is a class object in recommenderLab. With a little more time I could have created and evaluated that as well.
The main difference between offline and online datasets is the accuracy testing. With offline, as we used, the recommendations are tested against some “unknown” portion of test set, whereas if we were online that unknown group could be a live user being given a recommendation on the spot. The system can then learn based on click rates which would further improve accuracy and tie together even more interesting recommendations.
As shown here, one could put a bunch of model in a list and run them all, evaluate and choose a model for production. This type of method will continue to get easier with more computing power, but one must slow down and really think through what the goals of the system are, and what kind of experience you want the end user to see.